FilmLoverz is a new on-demand streaming platform. It’s simple: users log into their IMDB account, click “like” on a TV show, series or film and in a matter of minutes the requested media files are in their Plex library ready to be streamed.
FilmLoverz has set off on a journey to understand its audience through a data-first approach. They plan to use the results of the research to inform product recommendation and communication personalisation. They also want to optimise their CRM management by tailoring messaging to specific film preferences of distinct segments of the audience.
To kickoff the research, FilmLoverz has asked 872 users to rate some films they’ve seen and some demographic data in return of a few free downloads. The outcome of the research should be a segmentation model with a reasonable number of segments to power personalised marketing and which can be put in production.
# data loading and cleaning
user_demos <-
read_delim(
file = 'ml-100k/u.user',
col_names = c('user_id', 'age', 'gender', 'occupation', 'zip_code')
)
states <- read_csv('simplemaps_uszips_basicv1.78/uszips.csv')
user_demos['state'] <-
user_demos %>%
left_join(states, by = c("zip_code" = "zip")) %>%
select(state_name)
user_demos <-
user_demos %>%
filter(!state == "Alaska") %>%
filter(!state == "Hawaii")
user_ratings <-
read_delim(col_names = c('user_id', 'item_id', 'rating', 'timestamp'),
'ml-100k/u.data') %>%
mutate(timestamp = as_datetime(timestamp))
user_ratings <- user_ratings[user_ratings$user_id %in% user_demos$user_id,]
mycolors = c(brewer.pal(name="RdYlBu", n = 8),
brewer.pal(name="RdYlBu", n = 6),
brewer.pal(name="RdYlBu", n = 8))
movie_data <-
read_delim('ml-100k/u.item',
col_names = c('movie_id', 'movie_title','release_date', 'video_release_date','IMDb_URL', 'genre_unknown','genre_action','genre_adventure','genre_animation', 'genre_childrens', 'genre_comedy', 'genre_crime','genre_documentary', 'genre_drama', 'genre_fantasy','genre_horror', 'genre_musical','genre_mystery','genre_romance','genre_scifi', 'genre_thriller', 'genre_war', 'genre_western')
) %>%
mutate(release_date = year(dmy(release_date))) %>%
select(-video_release_date, -IMDb_URL)
par(mar = c(4, 4, .1, .1))
# occupation
occuptaion_plot(user_demos)
# age
age_plot(user_demos)
# gender split
gender_plot(user_demos)
# geo_spatial
geo_plot(user_demos)
The sample is made up of 872 users, out of which almost 75% are male and the rest female. They live all across the US, although the sample has some bias towards the North East. The most common occupations are student, educator, administrator, librarian and programmer followed by a healthy array of relatively diverse industries and jobs. The age distribution is centered at 33 years old, with a slight skew to the right (mean_age = 33.96; median_age = 31).
Users were asked to rate some of the films they have watched in the past. Below a table summarising the top 10 most frequent films ranked in each score (5 through to 1):
top_ratings <- top_ratings(user_ratings, movie_data)
top_ratings %>%
kbl() %>%
kable_styling(latex_options = "striped")
| n_5 | title_5 | n_4 | title_4 | n_3 | title_3 | n_2 | title_2 | n_1 | title_1 |
|---|---|---|---|---|---|---|---|---|---|
| 303 | Star Wars (1977) | 202 | Contact (1997) | 153 | Liar Liar (1997) | 68 | Liar Liar (1997) | 44 | Liar Liar (1997) |
| 213 | Fargo (1996) | 190 | Return of the Jedi (1983) | 128 | Scream (1996) | 62 | Independence Day (ID4) (1996) | 35 | Jungle2Jungle (1997) |
| 195 | Godfather, The (1972) | 185 | Toy Story (1995) | 118 | Mission: Impossible (1996) | 56 | Saint, The (1997) | 35 | Evita (1996) |
| 190 | Raiders of the Lost Ark (1981) | 163 | Star Wars (1977) | 112 | Air Force One (1997) | 55 | Scream (1996) | 33 | Beavis and Butt-head Do America (1996) |
| 173 | Schindler’s List (1993) | 160 | Air Force One (1997) | 104 | Contact (1997) | 52 | Twister (1996) | 30 | Mars Attacks! (1996) |
| 170 | Pulp Fiction (1994) | 159 | Fargo (1996) | 100 | Star Trek: First Contact (1996) | 52 | Dante’s Peak (1997) | 30 | Event Horizon (1997) |
| 169 | Silence of the Lambs, The (1991) | 157 | Jerry Maguire (1996) | 98 | Independence Day (ID4) (1996) | 52 | Volcano (1997) | 30 | Scream (1996) |
| 166 | Titanic (1997) | 157 | English Patient, The (1996) | 92 | English Patient, The (1996) | 50 | English Patient, The (1996) | 30 | Crash (1996) |
| 160 | Empire Strikes Back, The (1980) | 153 | Scream (1996) | 91 | Devil’s Own, The (1997) | 47 | Evita (1996) | 30 | Saint, The (1997) |
| 160 | Return of the Jedi (1983) | 151 | Silence of the Lambs, The (1991) | 90 | Return of the Jedi (1983) | 47 | Mission: Impossible (1996) | 27 | Cable Guy, The (1996) |
As the table already shows, there is a tendency in the sample to ‘overscore’ films, or simply to focus on the films they prefer rather than those which they didn’t like, as the following pot confirms:
par(mar = c(4, 4, .1, .1))
ratings_plot(user_ratings)
Implementing a clustering algorithm on the data as is would be too costly, hence first we reduce the dimensionality of the data (i.e. reduce the number of variables per observation). We also reduce the sparseness of the data. To this end we fit a simple autoencoder using Tensorflow on the movie ratings data. We trained the model for 100 epochs reaching a validation Mean Absolute Error of ~ 0.24, which is an acceptable reconstruction error.
# load packages
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
from sklearn import model_selection
import matplotlib.pyplot as plt
# load data
clustering_data = pd.read_csv('clustering_data.csv')
movie_cols = [col for col in clustering_data if col.startswith('mov')]
movie_data = clustering_data.loc[:,movie_cols]
movie_data = movie_data.fillna(0)
demo_cols = [col for col in clustering_data if not col.startswith('mov')]
user_id = clustering_data.user_id
demo_data = clustering_data.loc[:,demo_cols].drop('user_id', axis = 1)
# autoencoder
encoding_dim = 100
user_row = tf.keras.Input(shape=(1,movie_data.shape[1]))
encoded = layers.Dense(encoding_dim, activation='relu')(user_row)
decoded = layers.Dense(movie_data.shape[1], activation='sigmoid')(encoded)
autoencoder = tf.keras.Model(user_row, decoded)
# encoder
encoder = tf.keras.Model(user_row, encoded)
# reshape data
x_train, x_test = model_selection.train_test_split(movie_data, train_size = 0.8)
x_train, x_test = x_train.astype('float32').to_numpy(), x_test.astype('float32').to_numpy()
x_train = np.reshape(x_train,(x_train.shape[0],1,x_train.shape[1]))
x_test = np.reshape(x_test,(x_test.shape[0],1,x_test.shape[1]))
# compile and fit the model
autoencoder.compile(optimizer='adam', loss='mae')
history = autoencoder.fit(x_train, x_train,
epochs=100,
batch_size=10,
shuffle=True,
validation_data=(x_test, x_test),
verbose = 0
)
# visualise metrics
mae = history.history['loss']
val_mae = history.history['val_loss']
epochs = range(len(mae))
plt.clf()
plt.plot(epochs, mae, 'g', label='Training mae')
plt.plot(epochs, val_mae, 'b', label='Validation mae')
plt.title('Training and Validation Reconstruction MAE')
plt.legend(loc=0)
plt.draw()
fig1 = plt.gcf()
fig1.savefig('mae.png',dpi=100)
Having reduced the movie ratings information of each user to embeddings consisting of vectors of 100 numbers each, we normalise them and join demographic data back to form a complete data set to cluster. We then fit a KMeans algorithm to compute the distance between users, iterating to determine the ideal number of clusters using the elbow method and then label the original user data with a variable called “cluster”.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
# reshape data and get embeddings
clustering_data2 = movie_data.astype('float32').to_numpy()
clustering_data2 = np.reshape(clustering_data2, (clustering_data2.shape[0],1,clustering_data2.shape[1]))
clustering_data2 = encoder.predict(clustering_data2)
clustering_data2 = np.reshape(clustering_data2, (clustering_data2.shape[0], clustering_data2.shape[2]))
clustering_data2 = pd.DataFrame(normalize(clustering_data2, norm="l1"))
embeddings_df = clustering_data2.join(demo_data)
# use elbow method to determine optimal number of clusters
Sum_of_squared_distances = []
K = range(1,30)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(embeddings_df)
Sum_of_squared_distances.append(km.inertia_)
plt.clf()
plt.plot(K, Sum_of_squared_distances, 'bx-')
## [<matplotlib.lines.Line2D object at 0x7fabda35a748>]
plt.xlabel('k')
## Text(0.5, 0, 'k')
plt.ylabel('Sum_of_squared_distances')
## Text(0, 0.5, 'Sum_of_squared_distances')
plt.title('Optimal number of clusters')
## Text(0.5, 1.0, 'Optimal number of clusters')
plt.draw()
fig1 = plt.gcf()
fig1.savefig('elbow.png',dpi=100)
The above visualisation shows the number of clusters on the x-axis and the squared Euclidean distance between them on the y-axis. The elbow method is a somewhat subjective method used to select the number of clusters where the curve is more ‘bent’ (i.e. looks like an elbow). In this case, we are going to fit KMeans on 5 clusters and label each observation with the number of cluster it belongs to.
# use optimal number of clusters shown by elbow method, fit KMeans and save labels
km = KMeans(n_clusters=5)
clustering_model = km.fit(embeddings_df)
clusters = clustering_model.predict(embeddings_df)
clustering_data['cluster'] = clusters
clustering_data['user_id'] = user_id
clustering_data.to_csv('clustered_data.csv')
This is a visualisation of the attributes of each cluster. Each cluster is made up of the most similar users in terms of gender, age, location and film preferences. The following visualisation depict cluster makeup. Heatmaps show darker blue for the most frequent score of a selection of films, which is a way to show each segement’s particular taste in films.
par(mar = c(4, 4, .1, .1))
clustered_data <- read_csv('clustered_data.csv')
clusters <- clustered_data %>%
select(user_id, cluster)
demos_movies_clustered <- user_demos %>%
left_join(clusters, by = "user_id") %>%
left_join(user_ratings, by = "user_id") %>%
left_join(movie_data, by = c("item_id"="movie_id")) %>%
select(-timestamp, -zip_code)
# occupation
occuptaion_plot(demos_movies_clustered %>%
filter(cluster == 0,
.preserve = T))
# age
age_plot(demos_movies_clustered %>%
filter(cluster == 0,
.preserve = T))
# gender split
gender_plot(demos_movies_clustered %>%
filter(cluster == 0,
.preserve = T))
# geo_spatial
geo_plot(demos_movies_clustered %>%
filter(cluster == 0,
.preserve = T))
# top 10 movies per score
top_movies_cluster(0,demos_movies_clustered) %>%
kbl() %>%
kable_styling(latex_options = "striped")
| title_5 | n_5 | title_4 | n_4 | title_3 | n_3 | title_2 | n_2 | title_1 | n_1 |
|---|---|---|---|---|---|---|---|---|---|
| Star Wars (1977) | 90 | Scream (1996) | 65 | Scream (1996) | 41 | Independence Day (ID4) (1996) | 23 | Jungle2Jungle (1997) | 13 |
| Return of the Jedi (1983) | 69 | Chasing Amy (1997) | 60 | Liar Liar (1997) | 38 | Evita (1996) | 18 | Spawn (1997) | 11 |
| Empire Strikes Back, The (1980) | 63 | Contact (1997) | 57 | Mission: Impossible (1996) | 31 | Saint, The (1997) | 18 | Liar Liar (1997) | 10 |
| Titanic (1997) | 61 | Twelve Monkeys (1995) | 52 | Star Trek: First Contact (1996) | 30 | Volcano (1997) | 17 | Black Sheep (1996) | 9 |
| Pulp Fiction (1994) | 59 | Liar Liar (1997) | 45 | Air Force One (1997) | 27 | Spawn (1997) | 16 | Crash (1996) | 9 |
| Shawshank Redemption, The (1994) | 51 | Return of the Jedi (1983) | 42 | Back to the Future (1985) | 27 | Home Alone (1990) | 15 | Kull the Conqueror (1997) | 9 |
| Godfather, The (1972) | 47 | Rock, The (1996) | 42 | Independence Day (ID4) (1996) | 27 | Liar Liar (1997) | 15 | Batman Forever (1995) | 8 |
| Raiders of the Lost Ark (1981) | 47 | Toy Story (1995) | 42 | Toy Story (1995) | 27 | Twister (1996) | 15 | Batman Returns (1992) | 8 |
| Princess Bride, The (1987) | 46 | Indiana Jones and the Last Crusade (1989) | 38 | Broken Arrow (1996) | 26 | Dante’s Peak (1997) | 14 | Beautician and the Beast, The (1997) | 8 |
| Fargo (1996) | 44 | Jerry Maguire (1996) | 37 | Eraser (1996) | 26 | Devil’s Own, The (1997) | 14 | Devil’s Own, The (1997) | 8 |
movie_list <-
demos_movies_clustered %>%
group_by(movie_title) %>%
summarise(
n_ratings = length(rating)
) %>%
filter(n_ratings > 100) %>%
slice_sample(n=30) %>%
pull(movie_title)
movies_per_cluster_plot(0,
movie_list,
demos_movies_clustered)
par(mar = c(4, 4, .1, .1))
# occupation
occuptaion_plot(demos_movies_clustered %>% filter(cluster == 1,
.preserve = T))
# age
age_plot(demos_movies_clustered %>% filter(cluster == 1,
.preserve = T))
# gender split
gender_plot(demos_movies_clustered %>% filter(cluster == 1,
.preserve = T))
# geo_spatial
geo_plot(demos_movies_clustered %>% filter(cluster == 1,
.preserve = T))
# top 10 movies per score
top_movies_cluster(cluster=1,demos_movies_clustered) %>%
kbl() %>%
kable_styling(latex_options = "striped")
| title_5 | n_5 | title_4 | n_4 | title_3 | n_3 | title_2 | n_2 | title_1 | n_1 |
|---|---|---|---|---|---|---|---|---|---|
| Star Wars (1977) | 52 | English Patient, The (1996) | 49 | Liar Liar (1997) | 27 | Scream (1996) | 19 | Evita (1996) | 12 |
| Fargo (1996) | 44 | Contact (1997) | 42 | Air Force One (1997) | 26 | English Patient, The (1996) | 16 | Liar Liar (1997) | 10 |
| Godfather, The (1972) | 38 | Fargo (1996) | 34 | Mission: Impossible (1996) | 25 | Grumpier Old Men (1995) | 13 | Scream (1996) | 10 |
| Raiders of the Lost Ark (1981) | 29 | Star Wars (1977) | 33 | Murder at 1600 (1997) | 24 | Liar Liar (1997) | 12 | Mars Attacks! (1996) | 8 |
| Schindler’s List (1993) | 29 | Ulee’s Gold (1997) | 33 | Jerry Maguire (1996) | 21 | Dante’s Peak (1997) | 10 | Beavis and Butt-head Do America (1996) | 7 |
| Titanic (1997) | 29 | Dead Man Walking (1995) | 32 | Devil’s Own, The (1997) | 20 | George of the Jungle (1997) | 10 | Crash (1996) | 6 |
| Casablanca (1942) | 28 | Return of the Jedi (1983) | 32 | English Patient, The (1996) | 20 | Mission: Impossible (1996) | 10 | English Patient, The (1996) | 6 |
| English Patient, The (1996) | 27 | Sense and Sensibility (1995) | 31 | Scream (1996) | 20 | Birdcage, The (1996) | 9 | George of the Jungle (1997) | 6 |
| Graduate, The (1967) | 27 | Air Force One (1997) | 30 | Conspiracy Theory (1997) | 19 | Murder at 1600 (1997) | 9 | Trainspotting (1996) | 6 |
| L.A. Confidential (1997) | 27 | Godfather, The (1972) | 29 | Contact (1997) | 19 | Willy Wonka and the Chocolate Factory (1971) | 9 | Volcano (1997) | 6 |
movies_per_cluster_plot(1,
movie_list,
demos_movies_clustered)
par(mar = c(4, 4, .1, .1))
# occupation
occuptaion_plot(demos_movies_clustered %>% filter(cluster == 2,
.preserve = T))
# age
age_plot(demos_movies_clustered %>% filter(cluster == 2,
.preserve = T))
# gender split
gender_plot(demos_movies_clustered %>% filter(cluster == 2,
.preserve = T))
# geo_spatial
geo_plot(demos_movies_clustered %>% filter(cluster == 2,
.preserve = T))
# top 10 movies per score
top_movies_cluster(cluster=2,demos_movies_clustered) %>%
kbl() %>%
kable_styling(latex_options = "striped")
| title_5 | n_5 | title_4 | n_4 | title_3 | n_3 | title_2 | n_2 | title_1 | n_1 |
|---|---|---|---|---|---|---|---|---|---|
| Star Wars (1977) | 109 | Toy Story (1995) | 83 | Liar Liar (1997) | 61 | Independence Day (ID4) (1996) | 24 | Mars Attacks! (1996) | 14 |
| Fargo (1996) | 75 | Return of the Jedi (1983) | 74 | Scream (1996) | 45 | Broken Arrow (1996) | 22 | Saint, The (1997) | 14 |
| Raiders of the Lost Ark (1981) | 74 | Star Trek: First Contact (1996) | 72 | Mission: Impossible (1996) | 42 | Liar Liar (1997) | 22 | Cable Guy, The (1996) | 13 |
| Godfather, The (1972) | 64 | Rock, The (1996) | 66 | Return of the Jedi (1983) | 41 | Twister (1996) | 20 | Liar Liar (1997) | 13 |
| Princess Bride, The (1987) | 64 | Star Wars (1977) | 64 | Contact (1997) | 37 | Volcano (1997) | 19 | Jungle2Jungle (1997) | 12 |
| Pulp Fiction (1994) | 64 | Contact (1997) | 63 | Ransom (1996) | 37 | Devil’s Own, The (1997) | 18 | Kingpin (1996) | 12 |
| Silence of the Lambs, The (1991) | 62 | Silence of the Lambs, The (1991) | 58 | Truth About Cats & Dogs, The (1996) | 36 | Eraser (1996) | 18 | Event Horizon (1997) | 11 |
| Schindler’s List (1993) | 56 | Scream (1996) | 57 | Birdcage, The (1996) | 35 | Mars Attacks! (1996) | 18 | Independence Day (ID4) (1996) | 11 |
| Shawshank Redemption, The (1994) | 52 | Air Force One (1997) | 56 | Star Trek: First Contact (1996) | 35 | Dante’s Peak (1997) | 17 | Bio-Dome (1996) | 10 |
| Empire Strikes Back, The (1980) | 49 | Fugitive, The (1993) | 56 | Independence Day (ID4) (1996) | 34 | Saint, The (1997) | 17 | Crash (1996) | 10 |
movies_per_cluster_plot(2,
movie_list,
demos_movies_clustered)
par(mar = c(4, 4, .1, .1))
# occupation
occuptaion_plot(demos_movies_clustered %>% filter(cluster == 3,
.preserve = T))
# age
age_plot(demos_movies_clustered %>% filter(cluster == 3,
.preserve = T))
# gender split
gender_plot(demos_movies_clustered %>% filter(cluster == 3,
.preserve = T))
# geo_spatial
geo_plot(demos_movies_clustered %>% filter(cluster == 3,
.preserve = T))
# top 10 movies per score
top_movies_cluster(cluster=3,demos_movies_clustered) %>%
kbl() %>%
kable_styling(latex_options = "striped")
| title_5 | n_5 | title_4 | n_4 | title_3 | n_3 | title_2 | n_2 | title_1 | n_1 |
|---|---|---|---|---|---|---|---|---|---|
| English Patient, The (1996) | 16 | English Patient, The (1996) | 15 | English Patient, The (1996) | 11 | Chasing Amy (1997) | 5 | Liar Liar (1997) | 5 |
| Fargo (1996) | 16 | Air Force One (1997) | 14 | Volcano (1997) | 11 | Mrs. Doubtfire (1993) | 5 | Boogie Nights (1997) | 3 |
| Full Monty, The (1997) | 13 | Fargo (1996) | 14 | Evita (1996) | 9 | Boogie Nights (1997) | 4 | Event Horizon (1997) | 3 |
| Boot, Das (1981) | 12 | Godfather, The (1972) | 13 | Top Gun (1986) | 9 | Devil’s Advocate, The (1997) | 4 | Jungle2Jungle (1997) | 3 |
| L.A. Confidential (1997) | 12 | Titanic (1997) | 13 | Air Force One (1997) | 8 | George of the Jungle (1997) | 4 | Pulp Fiction (1994) | 3 |
| Lawrence of Arabia (1962) | 11 | Cool Hand Luke (1967) | 12 | Conspiracy Theory (1997) | 8 | Jerry Maguire (1996) | 4 | Blade Runner (1982) | 2 |
| Schindler’s List (1993) | 11 | To Kill a Mockingbird (1962) | 12 | Devil’s Own, The (1997) | 8 | Murder at 1600 (1997) | 4 | Broken Arrow (1996) | 2 |
| Sting, The (1973) | 11 | Evita (1996) | 11 | Mother (1996) | 8 | Sense and Sensibility (1995) | 4 | Conspiracy Theory (1997) | 2 |
| Amadeus (1984) | 10 | Star Wars (1977) | 11 | Birdcage, The (1996) | 7 | Soul Food (1997) | 4 | Crash (1996) | 2 |
| Bridge on the River Kwai, The (1957) | 10 | Dead Man Walking (1995) | 10 | In & Out (1997) | 7 | That Darn Cat! (1997) | 4 | Deceiver (1997) | 2 |
movies_per_cluster_plot(3,
movie_list,
demos_movies_clustered)
par(mar = c(4, 4, .1, .1))
# occupation
occuptaion_plot(demos_movies_clustered %>% filter(cluster == 4,
.preserve = T))
# age
age_plot(demos_movies_clustered %>% filter(cluster == 4,
.preserve = T))
# gender split
gender_plot(demos_movies_clustered %>% filter(cluster == 4,
.preserve = T))
# geo_spatial
geo_plot(demos_movies_clustered %>% filter(cluster == 4,
.preserve = T))
# top 10 movies per score
top_movies_cluster(cluster=4,demos_movies_clustered) %>%
kbl() %>%
kable_styling(latex_options = "striped")
| title_5 | n_5 | title_4 | n_4 | title_3 | n_3 | title_2 | n_2 | title_1 | n_1 |
|---|---|---|---|---|---|---|---|---|---|
| Star Wars (1977) | 44 | Return of the Jedi (1983) | 37 | Air Force One (1997) | 23 | Liar Liar (1997) | 17 | Beavis and Butt-head Do America (1996) | 11 |
| Godfather, The (1972) | 36 | Fugitive, The (1993) | 34 | Liar Liar (1997) | 21 | George of the Jungle (1997) | 14 | Evita (1996) | 7 |
| Schindler’s List (1993) | 36 | Raiders of the Lost Ark (1981) | 33 | Return of the Jedi (1983) | 21 | English Patient, The (1996) | 12 | Dante’s Peak (1997) | 6 |
| Silence of the Lambs, The (1991) | 35 | Contact (1997) | 31 | Rock, The (1996) | 21 | Scream (1996) | 12 | Leaving Las Vegas (1995) | 6 |
| Fargo (1996) | 34 | Toy Story (1995) | 31 | Sting, The (1973) | 20 | Batman Returns (1992) | 10 | Liar Liar (1997) | 6 |
| Raiders of the Lost Ark (1981) | 32 | Fargo (1996) | 30 | Contact (1997) | 19 | Saint, The (1997) | 10 | Scream (1996) | 6 |
| Casablanca (1942) | 30 | Amadeus (1984) | 29 | Back to the Future (1985) | 18 | Willy Wonka and the Chocolate Factory (1971) | 10 | Brazil (1985) | 5 |
| One Flew Over the Cuckoo’s Nest (1975) | 30 | Air Force One (1997) | 28 | Independence Day (ID4) (1996) | 18 | Conspiracy Theory (1997) | 9 | Clockwork Orange, A (1971) | 5 |
| Braveheart (1995) | 29 | Back to the Future (1985) | 27 | Indiana Jones and the Last Crusade (1989) | 18 | Crash (1996) | 9 | Event Horizon (1997) | 5 |
| It’s a Wonderful Life (1946) | 28 | Independence Day (ID4) (1996) | 27 | Jurassic Park (1993) | 18 | Grease (1978) | 9 | Grease (1978) | 5 |
movies_per_cluster_plot(4,
movie_list,
demos_movies_clustered)
Personifying each cluster, that is giving each cluster a kind of stereotypical identity which makes it easy to action for marketing teams is a qualitative, iterative and complex process. Collaboration between data scientists, marketers, creatives (sociologists, anthropologists or psychologists as well) is the key to creating outstanding, tailored-to-perfection personas.
The last step of the segmentation exercise is a multi-class classification model which can be put in production to score new users based on their film ratings and demographic data and assign them a cluster (or segment) label.
The cluster label is first one-hot encoded so that each label carries the information of whether the user belongs to each of the clusters rather than the information of which cluster does the user belong to:
clustered_data <-
clustered_data %>%
fastDummies::dummy_cols(remove_selected_columns = T,
select_columns = "cluster",
remove_first_dummy = F,
)
write_csv(clustered_data,"clustered_data.csv")
The labelled data set was split into training and test samples in a 70-30 ratio. A sequential, 2 layer neural net was fit using Tensorflow.
# load libraries
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
scaler = MinMaxScaler()
# read labelled data
clustered_data = pd.read_csv('clustered_data.csv')
clustered_data = clustered_data.iloc[:,1:].fillna(0)
# rescale movie rating
movies_rescaled = pd.DataFrame(
scaler.fit_transform(clustered_data[[col for col in clustered_data if col.startswith('mov')]]),
columns = [col for col in clustered_data if col.startswith('mov')])
clustered_data_rescaled = clustered_data[[col for col in clustered_data if not col.startswith('mov')]].join(movies_rescaled).drop('user_id', axis = 1)
clusters_labels = [col for col in clustered_data_rescaled if col.startswith('cluster')]
clusters = clustered_data_rescaled[clusters_labels]
# Split sample in training / validation 70 - 30
xtrain, xtest, ytrain, ytest = train_test_split(
clustered_data_rescaled,
clusters,
test_size = 0.3,
random_state=123)
xtrain, xtest, ytrain, ytest = xtrain.to_numpy(), xtest.to_numpy(), ytrain.to_numpy(), ytest.to_numpy()
xtrain, xtest = np.reshape(xtrain, (xtrain.shape[0],1,xtrain.shape[1])), np.reshape(xtest, (xtest.shape[0],1,xtest.shape[1]))
ytrain, ytest = np.reshape(ytrain, (ytrain.shape[0],1,ytrain.shape[1])), np.reshape(ytest, (ytest.shape[0],1,ytest.shape[1]))
# define and compile model
tf.keras.backend.clear_session()
regs = tf.keras.regularizers.l1_l2(l1=0.0001,l2=0.0001)
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation = 'relu', input_shape=(None,xtrain.shape[2]),kernel_regularizer=regs, bias_regularizer=regs),
tf.keras.layers.Dense(64, activation = 'relu', kernel_regularizer=regs, bias_regularizer=regs),
tf.keras.layers.Dense(5, activation = 'softmax')])
my_callbacks = [tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', patience=5, verbose=1),
tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=10,verbose=1, restore_best_weights=True)]
model.compile(loss='categorical_crossentropy', optimizer="Adam", metrics=["accuracy"])
# fit model
history = model.fit(
x=xtrain,
y=ytrain,
validation_data=(xtest, ytest),
epochs=50,
verbose=1,
batch_size=10,
callbacks=[my_callbacks])
Finally, to assess how good is the model at classifying users into their cluster, we can plot loss and multi-class classification accuracy at training and validation. Achieving a perfect classification at both stages, and seeing loss and accuracy fall and increase at an even rate are sign of satisfactory results. The model is ready to be put in production by your favourite machine learning ops or data engineering team.
CA = history.history['accuracy']
val_CA = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(CA))
plt.plot(epochs, CA, 'g', label='Training CategoricalAccuracy')
## [<matplotlib.lines.Line2D object at 0x7fac0b13a908>]
plt.plot(epochs, val_CA, 'b', label='Validation CategoricalAccuracy')
## [<matplotlib.lines.Line2D object at 0x7fac0b13ac50>]
plt.title('Training and validation CategoricalAccuracy')
## Text(0.5, 1.0, 'Training and validation CategoricalAccuracy')
plt.legend(loc=0)
## <matplotlib.legend.Legend object at 0x7fabda35ab38>
plt.draw()
fig1 = plt.gcf()
plt.show()
plt.plot(epochs, loss, 'g', label='Training loss')
## [<matplotlib.lines.Line2D object at 0x7fabec3176d8>]
plt.plot(epochs, val_loss, 'b', label='Validation loss')
## [<matplotlib.lines.Line2D object at 0x7fabec317a58>]
plt.title('Training and validation loss')
## Text(0.5, 1.0, 'Training and validation loss')
plt.legend(loc=0)
## <matplotlib.legend.Legend object at 0x7fabda395cc0>
plt.draw()
plt.show()